{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "x_eqz1Rt1La1"
   },
   "source": [
    "# Dingo - Data Quality Evaluation Tool Demo\n",
    "\n",
    "Dingo is a comprehensive data quality evaluation tool for large models, supporting multiple evaluation methods:\n",
    "- 🤖 **LLM Evaluation**: Intelligent evaluation using large language models\n",
    "- 📏 **Rule-based Evaluation**: Fast evaluation based on predefined rules\n",
    "- 📊 **Dataset Evaluation**: Batch evaluation of entire datasets\n",
    "\n",
    "This demo will show you how to use Dingo for single data evaluation and dataset batch evaluation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "id": "fiCeklhb1OHV",
    "outputId": "1f084a34-5d79-4cb1-912c-9ec344a51b71"
   },
   "outputs": [],
   "source": [
    "!pip install dingo-python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jxEvgpMn1ojd"
   },
   "outputs": [],
   "source": [
    "# Import Dingo related modules\n",
    "from dingo.config.input_args import EvaluatorLLMArgs\n",
    "from dingo.io.input import Data\n",
    "from dingo.model.llm.llm_text_quality_model_base import LLMTextQualityModelBase\n",
    "from dingo.model.rule.rule_common import RuleDocRepeat\n",
    "from dingo.config import InputArgs\n",
    "from dingo.exec import Executor\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "g2D9mU601rYY"
   },
   "source": [
    "## 🔍 Example 1: Single LLM Chat Data Evaluation\n",
    "\n",
    "This example shows how to evaluate the quality of single chat data using both LLM and rule-based methods.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fhSa1iCd1sJH"
   },
   "source": [
    "### 🔑 Setup API Key\n",
    "\n",
    "To use LLM evaluation, you need to provide your OpenAI API key:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cvXXaRhh1xiC"
   },
   "source": [
    "### 📝 Prepare Test Data\n",
    "\n",
    "Create a data object containing prompt and content:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "pIK1ueuL1y6d",
    "outputId": "22c121c2-d5f5-4801-8e1d-284b60d38d6d"
   },
   "outputs": [],
   "source": [
    "# Create test data\n",
    "data = Data(\n",
    "    data_id='demo_001',\n",
    "    prompt=\"hello, introduce the world\",\n",
    "    content=\"Hello! The world The world The world The world The world The world The world The world The world The world The world \"\n",
    ")\n",
    "\n",
    "print(f\"Data ID: {data.data_id}\")\n",
    "print(f\"Prompt: {data.prompt}\")\n",
    "print(f\"Content: {data.content}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_aefU-e211DF"
   },
   "source": [
    "### 🤖 LLM Evaluation\n",
    "\n",
    "Use large language models for intelligent data quality evaluation:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "VITcCtAY12dH",
    "outputId": "f1b05394-ef66-4ade-8922-c5ea1709133d"
   },
   "outputs": [],
   "source": [
    "def llm_evaluation():\n",
    "    \"\"\"Use LLM for data quality evaluation\"\"\"\n",
    "    try:\n",
    "        # Configure LLM parameters\n",
    "        LLMTextQualityModelBase.dynamic_config = EvaluatorLLMArgs(\n",
    "            key='your-key',\n",
    "            api_url='https://api.deepseek.com/v1',\n",
    "            model='deepseek-chat',\n",
    "        )\n",
    "\n",
    "        print(\"🤖 Evaluating data quality using LLM...\")\n",
    "        result = LLMTextQualityModelBase.eval(data)\n",
    "\n",
    "        return result\n",
    "\n",
    "    except Exception as e:\n",
    "        print(f\"❌ LLM evaluation error: {str(e)}\")\n",
    "        print(\"Please check if your API key is correct and network connection is stable.\")\n",
    "        return None\n",
    "\n",
    "# Execute LLM evaluation\n",
    "llm_result = llm_evaluation()\n",
    "print(llm_result)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XYFFfHw-3sYJ"
   },
   "source": [
    "### 📏 Rule-based Evaluation\n",
    "\n",
    "Use predefined rules for quick evaluation (checking line breaks and spaces):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "DS4ZXMf_3tNU",
    "outputId": "6b3ace52-b81f-4329-ff6e-34b278b35ec2"
   },
   "outputs": [],
   "source": [
    "def rule_evaluation():\n",
    "    \"\"\"Use rules for data quality evaluation\"\"\"\n",
    "    try:\n",
    "        print(\"📏 Evaluating data quality using rules...\")\n",
    "\n",
    "        # Use line break and space checking rules\n",
    "        rule_checker = RuleDocRepeat()\n",
    "        result = rule_checker.eval(data)\n",
    "\n",
    "        return result\n",
    "\n",
    "    except Exception as e:\n",
    "        print(f\"❌ Rule evaluation error: {str(e)}\")\n",
    "        return None\n",
    "\n",
    "# Execute rule evaluation\n",
    "rule_result = rule_evaluation()\n",
    "print(rule_result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5rEPEoNm39KE"
   },
   "source": [
    "## 📊 Example 2: Dataset Batch Evaluation\n",
    "\n",
    "This example shows how to perform batch evaluation on entire datasets using datasets from Hugging Face.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 347
    },
    "id": "UolIifHV3-uk",
    "outputId": "8c72b757-7a72-4cf5-efcd-5f2d2603ff22"
   },
   "outputs": [],
   "source": [
    "# Evaluate a dataset from Hugging Face\n",
    "input_data = {\n",
    "    \"eval_group\": \"sft\",           # Rule set for SFT data\n",
    "    \"input_path\": \"chupei/format-text\", # Dataset from Hugging Face\n",
    "    \"data_format\": \"plaintext\",    # Format: plaintext, json, listjson\n",
    "    \"save_data\": True              # Save evaluation results\n",
    "}\n",
    "\n",
    "input_args = InputArgs(**input_data)\n",
    "executor = Executor.exec_map[\"local\"](input_args)\n",
    "result = executor.execute()\n",
    "print(result)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
